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Abstract 



What do auto-encoders learn about the underlying data generating distribution? Re- 
cent work suggests that some auto-encoder variants do a good job of capturing the 
local manifold structure of data. This paper clarifies some of these previous intuitive 
observations by showing that minimizing a particular form of regularized reconstruc- 
tion error yields a reconstruction function that locally characterizes the shape of the 
data generating density. We show that the auto-encoder captures the score (derivative 
of the log-density with respect to the input), along with the second derivative of the 
density and the local mean associated with the unknown data-generating density. This 
is the second result linking denoising auto-encoders and score matching, but in way 
that is different from previous work, and can be applied to the case when the auto- 
encoder reconstruction function does not necessarily correspond to the derivative of 
an energy function. The theorems provided here are completely generic and do not 
depend on the parametrization of the auto-encoder: they show what the auto-encoder 
would tend to if given enough capacity and examples. These results are for a con- 
tractive training criterion we show to be similar to the denoising auto-encoder training 
criterion with small corruption noise, but with contraction applied on the whole re- 
construction function rather than just encoder. Similarly to score matching, one can 
consider the proposed training criterion as a convenient alternative to maximum like- 
lihood, i.e., one not involving a partition function. 



1 Introduction 



Machine learning is about capturing aspects of the unknown distribution from which the observed data 
are sampled (the data-generating distribution). For many learning algorithms and in particular in mani- 
fold learning, the focus is on identifying the regions (sets of points) in the space of examples where this 
distribution concentrates, i.e., which configurations of the observed variables are plausible. 

Unsupervised representation-learning algorithms attempt to characterize the data-generating distribu- 
tion through the discovery of a set of features or latent variables whose variations capture most of the 
structure of the data-generating distribution. In recent years, a number of unsupervised feature learning 
algorithms have been proposed that are based on minimizing some form of reconstruction error, such 
as auto-encoder and sparse coding variants (Olshausen and Field 1997} |Bengio et al 



2007 Ranzato 



eFori[2007]|Jain and Seu ng 2008 ; Ranzato et al. | |2008| |Vincent et al. [ 12008 ; Kavukcuoglu e t a7.[|2009| 



Rifai et al. 201 la|b||Gregor et al. 201 1 1. An auto-encoder reconstructs the input through two stages, an 
encoder function / (which outputs a learned representation h = f(x) of an example x) and a decoder 
function g, such that g(f(x)) m x for most x sampled from the data-generating distribution. These fea- 
ture learning algorithms can be stacked to form deeper and more abstract representations. Deep learning 
algorithms learn multiple levels of representation, where the number of levels is data-dependent. There 
are theoretical arguments and much empirical evidence to suggest that when they are well-trained, deep 



learning algorithms ( Hinton et al. | [2006 ; Bengio[ |2009| |Lee et al. \ [2009 Salakhutdinov and Hinton 
2009 Bengio and Delalleau 2011[ |Bengio et al. 2013 1 can perform better than their shallow coun- 
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terparts, both in terms of learning features for the purpose of classification tasks and for generating 
higher-quality samples. 

Here we restrict ourselves to the case of continuous inputs x € R d with the data-generating distribution 
being associated with an unknown target density function, denoted p. Manifold learning algorithms 



assume that p is concentrated in regions of lower dimension ( Cayton| [2005 ; Narayanan and Mitter 2010 1, 
i.e., the training examples are by definition located very close to these high-density manifolds. In that 
context, the core objective of manifold learning algorithms is to identify where the density concentrates. 

Some important questions remain concerning many of feature learning algorithms based on reconstruc- 
tion error. Most importantly, what is their training criterion learning about the input density! Do these 
algorithms implicitly learn about the whole density or only some aspect? The answers may help to estab- 
lish that these algorithms actually learn implicit density models, which only define a density indirectly, 
e.g., through the estimation of statistics or through a generative procedure. These are the questions to 
which this paper contributes. 

The paper is divided in two main sections, along with detailed appendices with proofs of the theorems. 
Section 2 makes a direct link between denoising auto-encoders ( ] Vincent et al.\ 2008 ) and contractive 



auto-encoders (Rifai et al. 201 la I, justifying the interest in the contractive training criterion studied 



in the rest of the paper. Section 3 is the main contribution and regards the following question: when 
minimizing that criterion, what does an auto-encoder learn about the data generating density! The 
main answer is that it estimates the score (first derivative of the log-density), i.e., the direction in which 
density is increasing the most, which also corresponds to the local mean, which is the expected value 
in a small ball around the current location. It also estimates the Hessian (second derivative of the log- 
density). 



2 Contractive and Denoising Auto-Encoders 



Regularized auto-encoders (see Bengio et al. (2012b) for a review and a longer exposition) capture the 



structure of the training distribution thanks to the productive opposition between reconstruction error 
and a regularizes An auto-encoder maps inputs x to an internal representation (or code) f(x) through 
the encoder function /, and then maps back f(x) to the input space through a decoding function g. The 
composition of / and g is called the reconstruction function r, with r(x) = g(f(x)), and a reconstruction 
loss function £ penalizes the error made, with r(x) viewed as a prediction of x. When the auto-encoder 
is regularized, e.g., via a sparsity regularizer, a contractive regularizer (detailed below), or a denoising 
form of regularization (that we find below to be very similar to a contractive regularizer), the regularizer 
basically attempts to make r (or /) as simple as possible, i.e., as constant as possible, as unresponsive to 
x as possible. It means that / has to throw away some information present in x, or at least represent it 
with less precision. On the other hand, to make reconstruction error small on the training set, examples 
that are neighbors on a high-density manifold must be represented with sufficiently different values of 
f(x) or r(x). Otherwise, it would not be possible to distinguish and hence correctly reconstruct these 
examples. It means that the derivatives of f(x) or r(x) in the ^-directions along the manifold must 
remain large, while the derivatives (of / or r) in the x-directions orthogonal to the manifold can be made 
very small. This is illustrated in Figure [T] below. In the case of Principal Components Analysis, one 
constrains the derivative to be exactly in the directions orthogonal to the chosen projection directions, 
and around 1 in the chosen projection directions. In regularized auto-encoders, / is non-linear, meaning 
that it is allowed to choose different principal directions (those that are well represented, i.e., ideally the 
manifold tangent directions) at different x's, and this allows a regularized auto-encoder with non-linear 
encoder to capture non-linear manifolds. Figure [2] illustrates the extreme case when the regularization 
is very strong (r(-) wants to be nearly constant where density is high) in the special case where the 
distribution is highly concentrated at three points (three training examples). It shows the compromise 
between obtaining the identity function at the training examples and having a flat r near the training 
examples, yielding a vector field r(x) — x that points towards the high density points. 



Here we show that the Denoising Auto-Encoder (| Vincent et al. 2008} with very small Gaussian cor- 



ruption and squared error loss is actually a particular kind of Contractive Auto-Encoder (|Rifai et aL\ 



201 la[ ), contracting the whole auto-encoder reconstruction function rather than just the encoder, whose 



contraction penalty coefficient is the magnitude of the perturbation. 
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X 

Figure 1 : Regularization forces the auto-encoder to become less sensitive to the input, but minimizing 
reconstruction error forces it to remain sensitive to variations along the manifold of high density. Hence 
the representation and reconstruction end up capturing well variations on the manifold while mostly 
ignoring variations orthogonal to it. 

r(x)t 




Figure 2: The reconstruction function r(x) (in green) which would be learned by a high-capcity autoencoder on 
a 1-dimensional input, i.e., minimizing reconstruction error at the training examples Xi (with r(xi) in red) while 
trying to be as constant as possible otherwise (large A). The figure is used to exagerate and illustrate the effect of 
the regularizer. The dotted line is the identity reconstruction (which might be obtained without the regularizer). The 
blue arrows shows the vector field of r(x) — x pointing towards high density peaks as estimated by the model, and 
estimating the score (log-density derivative), as shown in this paper. 



The Contractive Auto-Encoder or CAE (jRifai et ~al. 201 la i is a particular form of regularized auto- 
encoder which is trained to minimize the following regularized reconstruction error: 



C 



CAE 



= E 



£(x,r(x)) + A 



df(x) 


2 " 


dx 


F 



(1) 



where r(x) = g(f(x)) and is the sum of the squares of the elements of A. Both the squared loss 

£(x, r) — \\\x — r\\ 2 and the cross-entropy loss £(x 7 r) — — xlogr — (1 — x)\og(l — r) have been 
used, but here we focus our analysis on the squared loss because of the easier mathematical treatment it 
allows. Note that success in minimizing the CAE criterion strongly depends on the parametrization of / 
and g and in particular on the tied weights constraint used, with f(x) = sigmoid (VKx + b) and g(h) = 
sigmoid (W T h + c). The above regularizing term forces f (as well as g, because of the tied weights) 
to be contractive, i.e., to have singular values less than 1 y\ Larger values of A yield more contraction 
(smaller singular values) where it hurts reconstruction error the least, i.e., in the local directions where 
there are only little or no variations in the data. These typically are the directions orthogonal to the 
manifold of high density concentration, as illustrated in Fig. [2] 



The Denoising Auto-Encoder or DAE (Vincent et al. 2008) is trained to minimize the following denois- 
ing criterion: 

C DAE =E[£(x,r(N(x)))} (2) 

where N(x) is a stochastic corruption of x and the expectation is over the training distribution and the 
corruption noise source. Here we consider mostly the squared loss and Gaussian noise corruption, again 
because it is easier to handle them mathematically. 

Theorem 1. When using corruption noise N(x) = x + e with 

e ~ N (0, a 2 1) , 



'Note that an auto-encoder without any regularization would tend to find many leading singular values near 1 in 
order to minimize reconstruction error, i.e., preserve input norm in all the directions of variation present in the data. 
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the objective function Cdae is 



£>dae = 



1 



E 



•(aOH 



cr 2 E 



dr(x) 



dx 



+ o(a 2 ) 



0. 



The proof is in appendix and uses a simple Taylor expansion around x. 

This shows that the DAE with small corruption of variance a 1 is similar to a Contractive Auto-Encoder 
with penalty coefficient A = a 2 but where the contraction is imposed explicitly on the whole reconstruc- 
tion function r(-) = g (/(•)) rather than on /(•) alon^] 

This analysis motivates the study in the rest of this paper of the following training criterion: squared 
reconstruction loss plus contractive penalty on the reconstruction: 



C = E 



\r{x) 



A 



dr(x) 


2 " 


dx 


F_ 



(3) 



This is an analytic version of the denoising criterion with small noise A = a 2 , and also corresponds to a 
contractive auto-encoder with contraction on both / and g, i.e., on r. 



3 Minimizing the Loss to Recover Local Features of p(-) 
3.1 Solution from Calculus of Variations 

The central result of this paper is that in a non-parametric setting (without parametric constraints on r), 
the minimizer of the loss function defined by <j3j can be solved asymptotically as A — » 0. The exact 
meaning of this claim is made clearer in the following theorem. 

Theorem 2. Let p be a probability density function that is continuously differentiable once and with 
support R d (i.e. \/x £ M. d we have p(x) ^ 0). Let C\ be the loss function defined by 



p(x) 



\\r(x) — x\\ 2 + A 



dr(x) 


2 " 


dx 


F 



dx 



(4) 



for r : R d — > R d assumed to be differentiable twice, and < A £ i used as factor to the penalty term. 
Let r^(x) denote the optimal function that minimizes C\. Then we have that 



rl(x)=x + X dl ° g i P{x) +o(A) as A 
ox 

Moreover, we also have the following expression for the derivative 
dr* x (x) ^d 2 \ogp(x) 



0. 



dx 



I + X- 



dx 2 



+ o(A) as A -» 0. 



(5) 



(6) 



Both these asymptotic expansions are to be understood in a context where we consider {^MiA^ to 
be a family of optimal functions minimizing C\ for their corresponding value of A. The asymptotic 
expansions are applicable point-wise in x, that is, with any fixed x we look at the behavior as A — > 0. 

The proof is given in the appendix and uses the Euler-Lagrange equations from the calculus of variations. 

The idea that the scaling factor A > of the penalty term is brought to very small values is related to 
Theorem[T] where the scaling factor a is also studied for the asymptotic behavior as a — > 0. 

Consequently, we obtain an estimator of the score from 

d\ogp(x) 



dx 



(r(x)-x)/X + o(\) 



(7) 



2 In the CAE there is a also a contractive effect on g(-) as a side effect of the parametrization with weights tied 
between /(■) and <?(•). 
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3.2 Empirical Loss 



In an experimental setting, the expected loss Q is replaced by the empirical loss 

dr{x) 




r(iW) -x {n) 



2 

+ A 

2 



dx 



based on a sample l^-i drawn from p(x). 

Alternatively, the auto-encoder is trained online (by stochastic gradient updates) with a stream of exam- 
ples x( n \ which corresponds to performing stochastic gradient descent on the expected loss In both 
cases we obtain an auto-encoder that approximately minimizes the expected loss. 

An interesting question is the following: what can we infer from the data generating density when given 
an autoencoder reconstruction function r{x)l 

The premise is that this autoencoder r{x) was trained to approximately minimize a loss function that 
has exactly the form of ^ for some A > 0. This is assumed to have been done through minimizing the 

empirical loss and the distribution p was only available indirectly through the samples {x^ } n _ . We 
do not have access to p or to the samples. We have only r(x) and maybe A. 

We will now discuss the usefulness of r(x) based on different conditions such as the model capacity and 
the value of A. 

3.2.1 Perfect World Scenario 

As a starting point, we will assume that we are in a perfect situation, i.e., with no constraint on r (non- 
parametric setting), an infinite amount of training data, and a perfect minimization. We will see what can 
be done to recover information about p in that ideal case. Afterwards, we will drop certain assumptions 
one by one and discuss the possible paths to getting back some information about p. 

We use notation r\(x) when we want to emphasize the fact that the value of r(x) came from minimizing 
the loss with a certain fixed A. 

Suppose that r\ (x) was trained with an infinite sample drawn from p. Suppose also that it had infinite 
(or sufficient) model capacity and that it is capable of achieving the minimum of the loss function Q 
while satisfying the constraints that dT g^ is twice differentiable. Suppose that we know the value of A 
and that we are working in a computing environment of arbitrary precision (i.e. no rounding errors). 

As shown by Theorem^ we would be able to get numerically the values of d lQ gpO) a t any point x € R d 
by simply evaluating 

A ax 

In the setup described, we do not get to pick values of A so as to take the limit A — s- 0. However, it is 

ogp( 
dx 



assumed that A is already sufficiently small that the above quantity is close to 91 °sp( x ) f or a jj intents and 



puiposes. 

3.2.2 Simple Numerical Example 

To give an example of this in one dimension, we will show what happens when we train a non-parametric 
model f(x) to minimize numerically the loss (j4j) relative to p(x). 

The distribution p(x) studied is shown in Figure [5] and it was created to be simple enough to illustrate 
the mechanics. 

The model f(x) is fitted by dividing the interval [—10, 10] into M = 1000 partition points x\, . . . , xm 
evenly separated by a distance A. We then minimize numerically the loss function 



f(x i+ i) - r(xi) 



M M-1 

^2p(x l )A(r(x i ) - Xt) 2 + p(xi)A . 

i=l i=l x 
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(a) p(x) and (f(x) — x) / A (b) p(x) and q(x) oc exp ^ J 21 ' ^ - riy^ 



Figure 3: Left (a): the data generating density along with the estimated score r l a 0~ x _ Right (b): 
in the one-dimensional case, one can directed convert the estimated score into an estimated density q by 
simply summing the score from left to right, and we see visually that the estimator is excellent. 

Every value f(xi) for i = 1, . . . , M is treated as a free parameter. Setting to the derivative with respect 
to the f(xi) yields a system of linear equations in M unknowns. It can be solved exactly and this solution 
is plotted in Figure [3(a) 

In the one-dimensional case, the estimated score can be turned into a density by integrating: 
q(x) oc exp (^J2 X<X r ^ x ^ x ~ x ' ^ . As expected, we get into numerical instabilities as A becomes too small, 

but for A = 1CP 4 we can verify that max x . e [_ 10i io] \ p(x) — q(x)\ < 2.6 x 10~ 4 , and q is visually seen 
as a good fit to the data generating density p in Figure [3(b) 

This example supports the claim of Theorem [2] That is, the minimizer of the loss |4] behaves as 
x + \ 91o sp( x ) _|_ (_\) as A — > 0. This is illustrated by showing how the best numerical solution to 
a discretized version of the problem has exactly that behavior. 

3.2.3 Vector Field around a Manifold 

We extend experimentation to a 1 -dimensional manifold in 2-D space, in which one can visualize the 
vector field, and we go from the non-parametric estimator of the previous section to an actual auto- 
encoder trained by numerically minimizing the regularized reconstruction error. 

Two-dimensional data points (x, y) were generated along a spiral according to the following equations: 

x = 0.04sin(t), y = 0.04cos(i), t ~ Uniform (3, 12) 

A denoising auto-encoder was trained with Gaussian corruption noise a = 0.01. The encoder is f(x) = 
tanh(6-|- Wx) and the decoder is g{h) = c+Vh. The parameters (b, c, V, W) are optimized by BFGS to 
minimize the average squared error, using a fixed training set of 10 000 samples (i.e. the same corruption 
noises were sampled once and for all). We found better results with untied weights, and BFGS gave more 
accurate models than stochastic gradient descent. We used 1000 hiddens units and ran BFGS for 1000 
iterations. 

Figure [4] shows the data along with the learned score function (shown as a vector field), for two trained 
models, with different initial conditions and optimization hyper-parameters (zooming closer inside, for 
the right hand side panels). We see that that the vector field points towards the nearest high-density point 
on the data manifold, and that the sampling procedure does a good job of estimating the underlying 
distribution. The vector field is close to zero near the manifold (i.e. the reconstruction error is close to 
zero), also corresponding to peaks of the implicitly estimated density. The points on the manifolds play 
the role of sinks for the vector field. Other places where reconstruction error may be low but where the 
implicit density is not high are sources of the vector field, however, these are rarely present (e.g., not in 
the top row model). 
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Figure 4: The original 2-D data from the data generating density p(x) is plotted along with the vector 
field defined by the values of r(x) — x for trained autoencoders (corresponding to the estimation of the 
). Top plots are for one model and bottom plots for another. The left and right plots 



dx 



correspond to different levels of zooming. 



In most of the good models trained (as in Figure 4(a) I, we can see that, from far away, the spiral is 
attractive. However, in a particular instance (Figure |4(c)] i, a model that got a very accurate vector field 
most accurately inside the spiral happened to have unexpected spurious attractors outside of the spiral. 



This can be seen in Figure 4(c) 



Previous work (Rifai et al.,2Q\2, Bengio et al. |2013 1 has already shown that contractive auto-encoders 
(especially when they are stacked in a way similar to RBMs in a Deep Belief Net) learn good models of 
high-dimensional (such as images), and that these models can be used not just to obtain good representa- 
tions for classification tasks but that good quality samples can be obtained from the model, by a random 
walk near the manifold of high-density. 



3.2.4 Missing A 



When we are in the same setting as in section 3.2.1 but the value of A is unknown, we can modify ([8]) a 
bit and avoid dividing by A. That is, for a trained reconstruction function r(x) given to us we just take 
the quantity r(x) — x and it should be approximatively the score up to a multiplicative constant. 



r(x) 



d\ogp(x) 
dx 



Equivalently, if one estimates the density via an energy function (minus the unnormalized log density), 
then x — r(x) estimates the gradient of the energy function. 
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We still have to assume that A is small. Otherwise, if the unknown A is too large we might get a poor 
estimation of the score. 



3.2.5 Relation to Denoising Score Matching 

Naturally, the driving assumption behind the above statements is still that r(x) minimizes a loss of the 
form (|4ji for some p(x) and A. If it comes from any other procedure, or if it is limited in capacity (because 
we are in a parametric setting not including or close enough to the target function) and cannot achieve 
the minimum of (j4j), then more work is needed to be able to provide formal guarantees. 

However, a reassuring and very related result was obtained by |Vincent| ( |201 l| l. Motivated by the analysis 
of denoising auto-encoders, it is concerned with the case where we explicitly parametrize an energy 
function £(x), yielding an associated score function ip(x) = — 9£ g^ and we stochastically corrupt the 
original samples x ~ p to obtain noisy samples x ~ q a (x\x). In particular, the article analyzes the 
case where q a adds Gaussian noise of variance a 2 to x. The main result is that minimizing the expected 
square difference between ip(x) and the score of q a (x\x), 

tp n\if~\ dl °SQa(x\x) 2 
ExM\m x ) q~ II h 



is equivalent to performing score matching (Hyvarinen 2005 i with estimator ip(x) and target density 
q a (x) = J q a {x\x)p{x)dx, where p(x) generates the training samples x. Note that when a finite training 
set is used, q a {x) is simply a smooth of the empirical distribution (e.g. the Parzen density with Gaussian 

kernel of width a). When the corruption n 
that if we define a reconstruction function 



kernel of width a). When the corruption noise is Gaussian, q "^ x '> = x -^-, from which we can deduce 



r(x) = x + <J 2 ip(x), (9) 
then the above expectation is equivalent to 

E^if^f^ - ^|| 2 ] = \E^[\\r{x) - x\\ 2 ] 
a (J a 

which is the denoising criterion. This says that when the reconstruction function r is parametrized so 
as to correspond to the score ^ of a model density (as per eq. [9] and where rp is a derivative of some 
log-density), the denoising criterion on r with Gaussian corruption noise is equivalent to score matching 
with respect to a smooth of the data generating density, i.e., a regularized form of score matching. Note 
that this regularization appears desirable, because matching the score of the empirical distribution (or 
an insufficiently smoothed version of it) could yield undesirable results when the training set is finite. 
Since score matching has been shown to be a consistent induction principl e (|Hy varinen| |2005 1, it means 



that this denoising score matching (Vincent 2011 Kingma and LeCun} 2010| |Swersky et al.\ 201 1] > 



criterion recovers the underlying density, up to the smoothing induced by the noise of variance a z . By 
making a 2 small, we can make the estimator arbitrarily good (and we would expect to want to do that 
as the amount of training data increases). Note the correspondance of this conclusion with the results 
presented here, which show that (1) A = a 2 and (2) that minimizing the equivalent analytic criterion 
(based on a contraction penalty) estimates the score when A is small. The difference is that our result 
holds even when r is not parametrized as per eq. [9] i.e., is not forced to correspond with the score 
function of a density. 

3.3 Estimating the Hessian 

Since we have r ^>~ x a s an estimator of the score, we readily obtain that the Hessian of the log-density, 
can be estimated by the Jacobian of the reconstruction function minus the identity matrix: 



d 2 \ogp(x) dr(x) 
~dx~ 2 { ^x~~ I)/X 



as shown by equation |6]) of Theorem [2] 

Besides first and second derivatives of the density, other local properties of the density are its local mean 



and local covariance, discussed in the Appendix, section 5.3 
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4 Conclusion 



Whereas auto-encoders have long been suspected of capturing information about the data generating 
density, this work has clarified what some of them are actually doing, showing that they can actually 
implicitly recover the data generating density altogether. We have shown that regularized auto-encoders 
such as the Denoising Auto-Encoder and a form of Contractive Auto-Encoder are closely related to each 
other and estimate local properties of the data generating density: the first derivative (score) and second 
derivative of the log-density, as well as the local mean. Our results do not require the reconstruction func- 
tion to correspond to the derivative of an energy function as in Vincent ( 201 1) , but hold simply by virtue 
of minimizing the regularized reconstruction error training criterion. This suggests that minimizing a 
regularized reconstruction error may be an alternative to maximum likelihood for unsupervised learning, 
avoiding the need for MCMC in the inner loop of training, as in RBMs and Deep Boltzmann Machines, 



analogously to Score Matching (Hyvarinen 2005] Vincent 2011 1. Toy experiments has confirmed that 
a good estimator of the density can be obtained when this criterion is non-parametrically minimized. 

Many questions remain open and deserve futher study. A big question is how to generalize these ideas 
to discrete data, since we have heavily relied on the notions of scores, i.e., of derivatives with respect to 
x. 

We have mostly considered the harder case where the auto-encoder parametrization does not guarantee 
the existence of an analytic formulation of an energy function. It would be interesting to compare experi- 
mentally and study mathematically these two formulations to assess how much is lost (because the score 
function may be somehow inconsistent) or gained (because of the less constrained parametrization). 



Another interesting question, following up from|Rifai et al. (2012 1 and Bengio et al. ( 2012a i, is how to 



exploit the score estimator to sample from the implicit density estimator captured by the auto-encoder. 
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5 Appendix 

5.1 Relationship between Contractive Penalty and Denoising Criterion 
Theorem 1. When using corruption noise N(x) = x + e with 



e ~ M (0, a 2 1) 



the objective function C dae is 



1 



C DAE =2 {E[\\x-r(x)\\ 2 ] + a 2 E 



dr(x) 



dx 



+ o{a 2 ) 



as a — > 0. 

The proof is in appendix and uses a simple Taylor expansion around x. 
Proof. With a Taylor expansion around x we have that 

r(x + e) = r(x) + d li x l e + o(a 2 ). 



Substituting this into C dae we have that 

t-'DAE = E 



(r( 



dx 



o(a 2 )) 



= § (E [||* - r(x)|| 2 ] - 2E[e] T E [^f (x r{x))\ ) 



+ 



\Tr (E [ee T ] 



E 



dx 



UE[\\x-r(x)\\ 2 ]+a 2 E 



dr(x) 




dx 


) 


dr(x) 


2 


dx 


F 



o(a 2 ) 
+ o(a 2 ) 



(10) 



where in the second line we used the independance of the noise from x and properties of the trace, while 
in the last line we used E [ee T ] = a 2 I and E[e] =0 by definition of e. □ 
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5.2 Calculus of Variations 

Theorem 2. Let p be a probability density function that is continuously differentiate once and with 
support M. d (i.e. \/x £ M. d we have p(x) 7^ 0). Let C\ be the loss function defined by 



p(x) 



\\r(x) — x\\ 2 + X 



dr(x) 


2 " 


dx 


F_ 



dx 



for r : M. d — » M. d assumed to be differentiate twice, and < A G i used as factor to the penalty term. 
Let r* x (x) denote the optimal function that minimizes C\. Then we have that 

rXlx) =x + X dl0gP ^ x) + (A) as A -> 0. 
ox 

Moreover, we also have the following expression for the derivative 
d r \( x ) t , ,d 2 \ogp(x 



I + X- 



+ o(A) as X ->• 0. 



dx dx 2 

Both these asymptotic expansions are to be understood in a context where we consider {?'a(' i ')}a>o to 
be a family of optimal functions minimizing C\ for their corresponding value of X. The asymptotic 
expansions are applicable point-wise in x, that is, with any fixed x we look at the behavior as X — > 0. 



Proof. This proof is done in two parts. In the first part, the objective is to get to equation ( 13 1 that has to 
be satisfied for the optimum solution. 

We will leave out the A indices from the expressions involving r(x) to make the notation lighter. We 
have a more important need for indices k in rk(x) that denote the d components of r(x) £ K d . 

We treat A as given and constant for the first part of this proof. 

In the second part we work out the asymptotic expansion in terms of A. We again work with the implicit 
dependence of r(x) on A. 

(part 1 of the proof) 

We make use of the Euler-Lagrange equation from the Calculus of Variations. We would refer the reader 
to either ( |Daco rogna , 2004) or Wikipedia for more on the topic. Let 



f{xi, ...,x n ,r,r Xl ,.- -,r Xn ) = p{x) 



\\r(x) — x\\ 2 + X 



dr(x) 


2 " 


dx 


F 



where x = (xi, ...,x d ), r(x) = (n(x), . . . ,r d (x)) and r Xi = 
We can rewrite the loss C(r) more explicitly as 



C(r) 



p{x) 



X) — X 



d „ 



d d „ , « 2 



dxj 

i=l j=l 3 



dx 



(n{x) - Xl f 2 + xj2 dri[x)2 



3=1 



dxi 



dx 



to observe that the components ri(x), . . . , rd(x) can each be optimized separately. 

The Euler-Lagrange equation to be satisfied at the optimal r : M. d — > M. d is 

df 
Or 



df A d df 



dxi dr„ 



(11) 



In our situation, the expressions from that equation are given by 

df 



di 



2(r(x) — x)p(x) 
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df 

dr Xi 



2Xp{x) 



d f df 



dxi \dr Xi 



2X 



dp(x) 
dxi 

-2Xp(x) 



dx i 




dr\ 


dr 2 


dxi 


dxi 


a 2 n 


d 2 r 2 


dx 2 


dx 2 



dr d 



dr d 
dx > 



d 2 r d 
dx 2 



and the equality to be satisfied at the optimum becomes 



(r(x) - x)p(x) = \ 



OXi OXi 1 v / ox 



OXi OXi ~\ / ox. 



(12) 



As equation (Hi hinted, the expression ( 12 1 can be decomposed into the different components r k (x) 



. that make r. For k= 1, . . . , d we get 



(r k (x) - x k )p(x) = A£ ( WW*) +p{x ^ 2 r k (x) 



dxi dxi 



dxf 



As p(x) ^ by hypothesis, we can divide all the terms by p(x) and note that /p[x) — dlo sp( x ) 



dxi 



We get 



(a;) - x k = A 



51ogp(x) 5r fc (x) d 2 r k (x) 



dx, 



dxi 



dx] 



(13) 



This first thing to observe is that when A = the solution is just r k (x) — x k , which translates into 
r(x) = x. This is not a surprise because it represents the perfect reconstruction value that we get when 
we the penalty term vanishes in the loss function. 



(part 2 of the proof) 



This linear partial differential equation (13 1 can be used as a recursive relation for r k (x) to obtain a 
Taylor series in A. The goal is to obtain an expression of the form 



r(x) = x + Xh(x) + o(A) as A^O 
where we can solve for h(x) and for which we also have that 

d 4^=I + X df ^ + o(X) as A^O. 
dx dx 



(14) 



We can substitute in the right-hand side of equation ( 14 1 the value for r k (x) that we get from equation 
( 14 1 itself. This substitution would be pointless in any other situation where we are not trying to get a 
power series in terms of A around 0. 
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r k (x) = x k 



= %k 




d\ogp(x) dr k (x) d 2 r k {x) 



dxi 



dxi 



+^E 



d 



d\ogp{x) d 
dxi dxi 



d 2 r k (x) 

dX 1 : 



Xk 



d 

A E 

3=1 



dlogp(x) dr k {x) d 2 r k (x) 



dxj 



<= 1 
d 



dxi 



i=l 3=1 



<9a;, 9xi 



.Tfe +A 



<91ogp(a;) 
dx k 



A E 



d 2 r fc (x) 
dx 2 



dxj dxj 



+ X 2 p(X,x) 



8x4 



dx 2 



d\ogp(x) d I d\ogp{x) dr k {x) d 2 r k (x) 



dx 2 



(15) 
(16) 
(17) 
(18) 
(19) 
(20) 



Now we would like to get rid of that A Ylt—i 9 g^T^ *- erm by showing that it is a term that involves only 

powers of A 2 or higher. We get this by showing what we get by differentiating the expression for r k (x) 
in line ( 20 1 twice with respect to some I. 



dr k (x) 
dxi 



(i = l) + } p 2 ______ + a— 

dxidx k dxi 



E 



d 2 r k (x) 
dx 2 



+ Xp(X, x) 



d 2 r k (x) _ ^ d 3 \ogp{x) ^ d 
dxf dxfdx k dxj 



E 

v.*=l 



d 2 r k (x 



dx 



Since A is a common factor in all the terms of the expression of 

d\ogp(x) 



- Xp(X, x)j 
9 we get what we needed. That is, 



This shows that 
and 

which completes the proof. 
5.3 Local Mean 



r k (x) = x k + X " "q^^' + A 2 f?(A, x). 
d\ogp(x) 



r{x) = x + X- 



dx 



dx k 

+ o(A) as A 



dr(x) ,d 2 \ogp(x) 

K '=I + X + o(A) as A^O 



dx 2 



□ 



In preliminary work ( Bengio et al. |2012a I, we studied how the optimal reconstruction could possibly 
estimate so-called local moments. We revisit this question here, with more appealing and precise results. 

What previous work on denoising and contractive auto-encoders suggest is that regularized auto- 
encoders can capture the local structure of the density through the value of the encoding (or recon- 
struction) function and its derivative. In particular, Rifai et al. ( 2012| l; Bengio et q/.| (2012a) argue that 
the first and second derivatives tell us in which directions it makes sense to randomly move while pre- 
serving or increasing the density, which may be used to justify sampling procedures. This motivates us 
here to study so-called local moments as captured by the auto-encoder, and in particular the local mean, 
following the definitions introduced in Bengio et al. ( 2012a i. 
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5.3.1 Definitions for Locai Distributions 



Let p be a continuous probability density function with support M. d . That is, Va; E R d we have that 
p(x) 7^ 0. We define below the notion of a local ball Bs(xq), along with an associated local density, 
which is the normalized product of p with the indicator for the ball: 

Bs(xo) = {x s.t. ||x — xo|| 2 < 5} 
p(x)dx 



Zs(x ) = 
ps{x\x ) = 



'■s(xo) 

1 



Zs(x ) 



P (x)I(x e B s (x )) 



where Zs(xo) is the normalizing constant required to make ps(x\xo) a valid pdf for a distribution cen- 
tered on xo- The support of ps(x\xo) is the ball of radius 6 around x denoted by B$(xo). We stick 
to the 2-norm in terms of defining the balls Bs(xq) used, but everything could be rewritten in terms of 
another p-norm to have slightly different formulas. 

We use the following notation for what will be referred to as the first two local moments (i.e. local mean 
and local covariance) of the random variable described by ps(x\x Q ). 



m s (x ) 



def 



xps(x\xo)dx 



def 



Cs(x ) = / (x - m s (x ))(x - m s (x )) p s (x\x )dx 
Based on these definitions, one can prove (in appendix) the following theorem. 

Theorem 3. Let p be of class C 3 and represent a probabdity density function. Let xq £ K d with 
p(xo) 0. Then we have that 



ms(x ) = x + 5 



2 1 dlogp(x) 
d+2 dx 



+ o(5 3 ) 



This links the local mean of a density with the score associated with that density. Combining this theorem 
with Theorem[2] we obtain that the optimal reconstruction function r* (•) also estimates the local mean: 



m s (x)-x = j——{r*{x)-x) +A(6) + 5 2 B(X) 



(21) 



for error terms A(S), B(X) such that 



A(S) € o(S 3 ) as 6->0, 
B(A) e o(l) as A^O. 

This means that we can loosely estimate the direction to the local mean by the direction of the recon- 
struction: 

n%s(x) — x oc r*(x) — x. (22) 

5.4 Asymptotic formuias for localised moments 

Proposition 1. Let p be of class C 2 and let xq £ Mr. Then we have that 



Zs(x Q ) = S d 



T d/2 



where H(xq) 



d 2 p(x) 
dx 2 



r(i + rf/2) 

Moreover, we have that 



2(d + 2) 



1 =§ _ d r(i + d/2) 



Zs(x ) 



T d/2 



1 -^M + o(*») 



p(x ) p(x ) 2 2{d + 2) 
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Proof. 

Zs(x ) 



pOo) 



dp(x) 



dx 



(x - x Q ) + — (x - x ) T H(x )(x - x ) 



+ |£ (3) pM(x-x ) + o(<5 3 ) 



dx 



p(x )8 d 
6 



Bg(x ) 

n d/2 



dx + + — / (x — xq ) T H(x )(x- x )dx + + o(S d+3 ) 



Bs(x ) 
-jrd/2 



r(l + d/2) 4r(2 + d/2) 

^/2 



T* (#(*<>)) 



r(i + d/2) 



We use Proposition [3] to get that trace come up from the integral involving H(xq). The expression for 
1/Z$(xa) comes from the fact that, for any a, b > we have that 



a + b5 2 +o(S 3 ) l + ^ 2 +o(<P) a 



' r i-(-6 2 + o(5 3 ))+o(5 4 ) 



1 b 



a a* 



S 2 + o(S 3 ) as S -> 0. 



by using the classic result from geometric series where = 1 
Now we just apply this to 



for \r\ < 1. 



1 



Zs(x ) 
and get the expected result. 



6 



_ d r(i + d/2) 



1 



rd/2 



□ 



Theorem 4. Lef p fee of class C 3 and represent a probability density function. Let xq € K d wi/n 
p(xo) 7^ 0. 7ne« we /lave that 



ms(x ) = Xq + 8 



., 1 d\ogp{x) 



d + 2 dx 



o(S 3 



Proof. The leading term in the expression for ms(xo) is obtained by transforming the x in the integral 
into a x — xq to make the integral easier to integrate. 

ms(x ) = ) r / xp(x)dx = x H 7-^/ (x - x )p(x)dx. 

^Sl^oj JBs(xa) z s (x a ) Jb s {xo) 

Now using the Taylor expansion around xq 



m<s(xo) = x + 



^(^o) 



Bs(io) 



(x - Xq) 



<9p(a 



dx 



(x - x ) 



+ 2^-^ ^ 



(x - x ) + o(\\x - x || ) 



dx. 
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Remember that f Bg r Xo ) f(x)dx — whenever we have a function / is anti-symmetrical (or "odd") 
relative to the point xq (i.e. f(x — xq) = f(—x — Xq)). This applies to the terms (x — xo)p(xo) and 
[x — Xq)(x — Xo) 9 1^ (x — xq) t . Hence we use Proposition 2 to get 



m s (x Q ) = x + 



x 



1 

Zs(xo) JBs(x ) 
1 I Xd+2 



(X ~ Xq) 



dx 
dp(x) 



Zs(xo) 



2T (2+|) y dx 



(x - Xq) + 0(\\x - X \\ ) 



+ o(5 3 ). 



dx 



Now, looking at the coefficient in front of dP Q^ 
it as 



in the first term, we can use Proposition 1 to rewrite 



1 



7T 2 



S d + 2 " I = s 

Zs(x Q ) v 2r(2+f) J 



_ d r (i + rf/2) 

TTd/2 



1 2 1 Tr(H(x )) 3 
pjxo) p(x )i 2(d+2) + 1 ] 



rd+2 



7T 2 



2r (2 



r (i + - 

£2 L \ L ^ 2 



2T (2 + f ) 



1 



5 2 1 Tr(g(x )) 

p(a? ) P(^o) 2 2(d + 2) 



(<5 3 ) 



= 5' 



1 1 

p(x ) d + 2 



There is no reason the keep the — 5 4 2 X , 1 -, 2 Tr ^M^ m the above expression because the asymp- 

totic error from the remainder term in the main expression is o(5 3 ). That would swallow our exact 
expression for <5 4 and make it useless. 

We end up with 



m s {x ) = x + 8 Z 



1 d\ogp(x) 



d + 2 dx 



o(6 3 



□ 



5.5 Integration on balls and spheres 

This result comes from Multi-dimensional Integration : Scary Calculus Problems from Tim Reluga (who 
got the results from How to integrate a polynomial over a sphere by Gerald B. Folland). 

Theorem 5. Let B = |a; € R d /Cj=i x? < l \ be the ball of radius 1 around the origin. Then 



nr p 



a.j+1 



2 

for any real numbers aj > 0. 

Corollary 1. Let B be the ball of radius 1 around the origin. Then 



nr 



\\x a /dx= I r(i+#+|E«i) 



if all the Qjare even integers 



j=i I otherwise 

for any non-negative integers aj > 0. Note the absence of the absolute values put on the x'j 3 terms. 
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Corollary 2. Let Bg(0) C K d be the ball of radius S around the origin. Then 



T x 1 ; 1 dx = 







nrK 



ofnerw/se 



//aZZ f/ie a^are even integers 



for any non-negative integers a,j > 0. Afofe the absence of the absolute values on the x°j J terms. 

Proof. We take the theorem as given and concentrate here on justifying the two corollaries. 

Note how in Corollary[T]we dropped the absolute values that were in the original Theorem[5] In situations 
where at least one a,j is odd, we have that the function f(x) = Ylj=i x °j becomes odd in the sense that 
f(—x) — —f(x). Because of the symmetrical nature of the integration on the unit ball, we get that the 
integral is as a result of cancellations. 

For Corollary |5j we can rewrite the integral by changing the domain with yj = xj/5 so that 



5- £ a i 



TT x® 3 dx 



TT {xj/d)^ dx 

B s (0)fJi 



T\v a ^ d dy. 



We pull out the S d that we got from the determinant of the Jacobian when changing from dx to dy and 
Corollary [2] follows . 



□ 



Proposition 2. Let v £ M. d and let Bg(0) C R d be the ball of radius 5 around the origin. Then 

y <v,y > dy 



rd+2 



s(0) 

where < v,y > is the usual dot product. 
Proof. We have that 



2r 2 



y <v,y > 



v dVd 



which is decomposable into d component-wise applications of Corollary [2] This yields the expected 
result with the constant obtained from = = \\f^- 



□ 



Proposition 3. Let H € R dxd and let Bs(xq) C R d be the ball of radius 5 around x g R d . Then 
/ (x — Xq) t H(x — Xo)da 

JB s (x„) 



■jrd/2 

ix = 5 d+2 ; —trace (H) . 

2r(2 + d/2) 1 ; 



Proof. First, by substituting y = [x — Xq) /6 we have that this is equivalent to showing that 



(0) ^ g ^ = 2 r(2 + d/2) trace(g) 



This integral yields a real number which can be written as 
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/ V T Hydy= V y.H.^jdy = V / ll,ll,ll,.,<h}- 

J 3,(0) J 3,(0) —. —. J Bl (0) 

Now we know from Corollary [2]that this integral is zero when i ^ j. This gives 

£ H tJ / Bi(Q) yi y jd y = £ y*dy = trace (fl) 2J ^ W 



□ 
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